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ABSTRACT: We propose a flexible prior model for the parameters of binary 
Markov random fields (MRF) defined on rectangular lattices and with maximal 
cliques defined from a template maximal clique. The prior model allows higher- 
order interactions to be included. We also define a reversible jump Markov chain 
Monte Carlo (RJMCMC) algorithm to sample from the associated posterior 
distribution. The number of possible parameters for an MRF with for instance 
k x l maximal cliques becomes high even for small values of k and l. To get a 
flexible model which may adapt to the structure of a particular observed image 
we do not put any absolute restrictions on the parametrisation. Instead we 
define a parametric form for the MRF where the parameters have interpretation 
as potentials for the various clique configurations, and limit the effective number 
of parameters by assigning apriori discrete probabilities for events where groups 
of parameter values are equal. 

To run our RJMCMC algorithm we have to cope with the computationally 
intractable normalising constant of MRFs. For this we adopt a previously defined 
approximation for binary MRFs, but we also briefly discuss other alternatives. 
We demonstrate the flexibility of our prior formulation with simulated and real 
data examples. 

Key words: Approximate inference; Ising Model; Markov random fields; Reversible jump 
MCMC. 


1 Introduction 


Markov random fields (MRF) are frequently used as prior distributions in spatial statistics. 
A common situation is that we have an observed or latent field x which we model as an MRF, 


p(x\<fi), conditioned on a vector of model parameters 0. 
literature is to consider 0 as fixed, see for instance examples in 


me most common situation m tne 


Besae 

(1986 

) and 

Hurn et al. 
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020031) . but several articles have also considered a fully Bayesian approach by assigning a prior 


on <fi. A fully Bayesian model is computationally simplest when x\<p is a Gaussian Markov 
random field (GMRF) and this case is therefore especially well developed. A flexible imple¬ 


mentation of the GMR 
software, see Rue et al 


case is given in the integrated nested Laplace approximation (INLA) 


(120091) and 


Martins et ah 


(120131) . The case when the components of 


x are discrete variables is computationally much harder and therefore less developed in the 


literature. However, some artic 
case, see in particular the early 


and t 


re more recent 


es h ave con sidered the fully 


Heikkinen and Hogmanderl (1199411 an d 


Mpllcr et al. 


( 20061 ), 


Friel et al. 


Bayesian approach also in this 


Higdon et al. 


(2012) and Tjelmeland and Austad (2012) 


(120091) . lAustadl (120111) . 


(1997) 


McGrorv et al. 


MRFs is a very flexible class of models. Formally, any distribution is an MRF with 
respect to a neighbourhood system where all nodes are neighbours of each other. For the 
MRF formulation to be of any help, however, reasonably small neighbourhoods must be 
adopted. The typical choice in the literature is to assume each node to have the nearest four, 
eight or 24 other nodes as neighbours. Moreover, in the model specification it is common 
to restrict oneself to models that include interactions between pairs of nodes only. Such 
pairwise interaction priors are just token priors, unable to specify more spatial structure 
than that nodes close to each other should tend to have the same value. In the literature it 
is often argued that such token priors are sufficient in many applications, as the information 
that neighbour nodes should tend to have the same value is the information lacking in 
the observed data. In particular one typically gets much better results based on such a 
token prior than by not including any spatial prior information in the analysis at all. The 
main reason for resorting to pairwise interaction priors, in addition to the argument that 
these are good enough, is that the class of MRFs with higher-order interactions is so large 
that it becomes difficult both to select a reasonable parametric form for the prior and to 


specify associated parameter values, not to mention the specification o 


Bayesian approach is adopted. However, 


Descombes et al. 


(119951) and 


a hyper-prior if a fully 


Tjel mela nd and Besa g 


( 19981) demonstrate that it is possible to specify MRFs with higher-order interactions that 
are able to model more spatial structure than a pairwise interaction MRF, and where the 
model parameters have a reasonable interpretation. 
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In this article we consider the fnlly Bayesian approach and for simplicity we limit the 
attention to the case where the components of x are binary. Our focus is on the specification 
of a prior distribution and on simulation from the associated posterior distribution. We 
define priors both on the parametric form of the MRF and on the parameter values. To the 
best of our knowledge this is the first attempt on putting a prior on the parametric form 
of a discrete MRF. Other articles considering a fully Bayesian approach in such a setting, 
are using a fixed parametric model and put a prior on the parameter values only. One 
should note that by assigning a prior to the parametric structure of the model, including the 
number of parameters, we get an automatic model choice when simulating from the posterior 
distribution. 

To be able to define a reasonable prior it is essential to adopt a model where the param¬ 
eters have a natural interpretation. In this article we consider two ways to parametrise 


the MRF. The first approac 

in the log-linear and graphica 

l is inspired by the so-called u-paramet 

model literature for contingencv tables 

ers 

(Bis 

commonly 

lop et al. 

usee 

1975 

Dellaportas and Forster 

1999; 

Massam et ah 

2009; 

Overstall and King 

2014 

). Here the pa- 


rameters are interactions of different orders. To limit the complexity of the model is easy 
by restricting some of the parameters to be zero, but we argue that the interpretation of 
the parameters^ difficu lt. Th e seco nd parametrisation we consider is inspired by the MRF 


formulation in 


Tjelmeland and Besag (1998). The parameters then represent potentials for 


configurations in maximal cliques, and we lim it the model complexity 


configurations to have the same potential. In iTie lmeland and Besag (1199811 this grouping of 


oy restricting different 


configurations is done manually, whereas we assign a prior to the grouping so that it is done 
automatically in the posterior simulation. Thereby we do not need, for example, to specify 
apriori whether or not the field is isotropic. We argue that the interpretation of the config¬ 
uration potentials is much easier than for the interactions, and unless any particular prior 
information is available and suggest the opposite, it is natural to assume the configuration 
potentials to be on the same scale. 

To explore the resulting posterior distribution we construct a reversible jump MCMC 
(RJMCMC) algorithm (IGreenl 1199511 . To run this algorithm we have to cope with the com¬ 
putationally intractable normalising constant of the MRF. In the literature several strategies 


3 







































for handling this have been prop osed. We adopt an approximation strategy for binary 


MRFs introduced in 


Austadl (2 0111 ). where a partially ordered Markov model (POMM), see 


Cressic and Davidson (jl998|), approximation to the MRF is defined. We simply replace the 


MRF with the corresponding POMM approximation. 

The article has the following organisation. In Section [2] we discuss the two parametrisa- 
tions of binary MRFs, and in particular we identify the maximal number of free parameters 
for a model with specified maximal cliques. In Section [3] we define our prior for 0, and in 
Section 0] we discuss how to handle the computationally intractable normalising constant and 
describe our RJMCMC algorithm for simulating from the posterior distribution. In Section 
[5] we present results for one simulated data example and for one real data example. One 
additional simulated example is given in the supplementary materials. Finally, some closing 
remarks are provided in Section [6] 


2 MRF 

n this section we give a brief introduction to MRFs, see 


Cressic 

(1993 

) and 

Hurn et al. 


2003) for more details, and in particular we focus on binary MRFs and the parametrisation 


in this case. We close with one example of a binary MRF, the Ising model. This section 
provides the theoretical background needed in order to understand the construction of our 
prior distribution in Section [3l and the RJMCMC algorithm given in Section [4l 


2.1 Binary MRF 

Consider a rectangular lattice of dimension n x m, and let the nodes be identified by (i,j) 

where i — 0, — 1 and j = 0, — 1. To each node (i,j) G S — {(i,j)',i = 0, — = 

0, ..., m — 1} we associate a binary variable Xij G {0,1}, and let x = (xij] ( i,j ) G S) be the 

collection of these binary variables. We let xa = (b j) G A ) denote the collection of the 

binary variables with indices belonging to an index set ACS, and let X-(i,j) = x s\{(i,j)}- 

Associating zero with black and one with white we may thereby say the x specifies a colouring 

of the nodes. We let J\f = {_A/"o,o, •••, Af n -i,m-i} be a neighbourhood system on S, where 

Nij CS \ {(z, j)} is the set of neighbour nodes of node (i,j). We assume symmetry in the 
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neighbour sets, so if (i,j) G A/* iU , then also (£, u) G A/)j. Now, x is a binary MRF if p{x) > 0 
for all x, and fulhls the Markov property 


p(xij\x-(i,j)) =p(xij\xx it .) for all (i,j) G S. 


( 1 ) 


A clique is defined to be a set A C S, where (i,j) G Nt, u f° r all distinct pair of nodes 
(i,j), (t,u) G A, and we denote the set of all cliques by £. Note that by this definition sets 
containing only one node and the empty set are cliques. A maximal clique is defined to be 
a clique that is not a subset of another clique, and we denote the set of all maximal cliques 
by C rn . Moreover, for A G £ we let £^ denote the set of all maximal cliques that contains 
A, i.e. = {A G £ m ; A C A}. In the following we use A and A* to denote maximal cliques, 
i.e. A, A* G £ m , whereas we use A and A* to denote cliques that do not need to be maximal, 
i.e. A, A* G £. To denote an x where Xij = 1 for all (i,j) G A for some A C S and Xij = 0 
otherwise, we use 1 A = {xij = I((i,j) G A); (i,j) G S}. Thereby a colouring of the nodes 
in a maximal clique A G £ m may be specified by 1^, where A C A specifies the set of nodes 
in A that has the value one. 


According to the Hammersley-Clifford theorem (jClifford 
the distribution p(x) of an MRF can take is 


19901), the most general form 


p{x) = Zexp{U(x)} with U(x) — ^ V\(x\), (2) 

A G/Im 

where Z is the computationally demanding normalising constant, U(x) is frequently called 
the energy function, and V\(x\) is a potential function for A. A naive parametrisation of 
Va(xa) is to introduce one parameter for each possible A G C m and x\ G {0,1}I A I by setting 


unit) = 


(3) 


It is a well known fact the (f)\ parameters do not constitute a unique representation of U(x). 

Thereby, in the resulting parametric model p(x) the (J)\ parameters are not identifiable, 

meaning that different choices for the (p\ parameters may give the same model p(x). For 

example, adding the same value to all parameters will not change the model, as this will 

be compensated for by a corresponding change in the normalising constant Z. If the set of 

maximal cliques £ m consists of, for example, all 2x2 blocks of nodes a perhaps less obvious 
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way to change the parameter values without changing the model and neither the normalising 
constant, is to add an arbitrary value to for some (i,j) 6 A 6 C m , and to subtract 

the same value from for some A* e £ m , A* 7 ^ A for which (i,j) £ A*. 

An alternative way to represent an MRF is through a parametrisation of the cliques. 
The energy function U(x}_ is a pseudo-Boolean function and when it is given as in (|2j) 
Tielmeland and Austadl (2012) show that it can be represented as 


U(x)=Y,^ II 


x. 




( 4 ) 


Ae£ 


0j)ga 


where /3 A is referred to as the interaction parameter for clique A, which is said to be of 


Grabisch et al. 

(2000 

) and 

Hammer and Holzman 

(1992) 


of linearly independent functions of x, it is clear that the set of interaction parameters is 
a unique representation of U(x). Furthermore, in the corresponding parametric model p(x) 
th e parameters become identifiable if fixing /3® to zero (say). We note in passing that 
Besad (1974) uses the representation in (j4j) in a proof for the Hammersley-Clifford theorem. 

I 11 the following we define a set of constraints on the <f\ parameters in (J2J) and show that 
subject to these constraints there is a one-to-one relation between the <p\ parameters and the 
interaction parameters /3 A . The constrained <f>\ parameters thereby constitute an alternative 
unique representation of U{x). 

Definition 1 The constrained set of parameters are defined by requiring that <f\ = <p \* 
for all A, A* £ C m , A C Afl A*. To simplify the notation we then write <f\ = (p x . 


To understand the implication of the constraint one may again consider the situation where 
the set of maximal cliques C m consists of all 2 x 2 blocks of nodes, and focus on the two 
overlapping maximal cliques A = {(i, j — 1), (i + 1, j — 1), (i, j), (i + 1, j)} and A* = {(I, j), (i + 
1, j), ( i,j + 1), (* + l, j + 1)}. For A = {(i, j), (i + 1, j)} the constraint is that the potential 
V\(x\) for the colouring in A is the same as the potential Va*(xa*) for the colouring in 
A*. One should also note that the constraint implies that <f>\ is the same for all A £ C m , so 
in the 2 x 2 maximal cliques case the potential for the colouring od must be the same for all 
maximal cliques. 
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Theorem 1 Consider an MRF and constrain the (f) parametrisation of the potential func¬ 
tions as described in Definition Q} Then there is a one-to-one relation between {/3 A ; A G £} 
and {0 A ; A G £}. 

The proof is given in the supplemental material, and the result is shown by establishing 
recursive equations showing how to compute the /3 A, s from the 0 A, s and vice versa. 

To simplify the definition of a prior for the parameter vector of an MRF in the next 
section, we first limit the attention to stationary MRFs defined on a rectangular n x m 
lattice, and to obtain stationarity we assume torus boundary conditions. In the following we 
define the concepts of stationarity and torus boundary conditions and states two theorems 
which identify what properties the {/3 A ; A 6 £} parameters and the {</> A ; A G £} parameters 
must have for the MRF to be stationary. 

Definition 2 If, for a rectangular lattice S = {(i,j)',i — 0, ..., n — 1, j — 0,..., m — 1}, the 
translation of a node ( i,j ) G S with an amount ( t,u ) G S is defined to be 

( i,j ) © (t, u) — (i + t mod n,j + u mod m), 

we say that the lattice has torus boundary conditions. 

To denote translation of a set of nodes A C S by an amount (t, u) G S we write A © (t, u ) = 
{(i, j)ffi(t, u); (i, j ) G A}. With this notation stationarity of an MRF defined on a rectangular 
lattice with torus boundary conditions can be defined as follows. 

Definition 3 An MRF x defined on a rectangular lattice S with torus boundary conditions 
is said to be stationary if and only if p( 1 A ) = p( i A ©(t«)) f or all A C S and ( t,u ) G S. 

To explore what restrictions the stationarity assumption puts on the (3 X and 0 A parameters we 
assume the set of maximal cliques to consist of all possible translations of a given nonempty 
template set A 0 C S, i.e. 

= {A 0 © (t, u); (t, u) G S}. (5) 

For example, with A 0 = {(0, 0), (0,1), (1, 0), (1,1)} the set of maximal cliques will consist of 
all 2 x 2 blocks of nodes. One should note that with the torus boundary assumption there 
is always \C m \ = nm maximal cliques. 
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Theorem 2 An MRF x defined on a rectangular lattice S = = 0,... ,n — 1, j = 

0,... ,m — 1} with torus boundary conditions and C m given in ([5j) is stationary if and only 
if /3 X = fi x ®dP f or a n A g £, (t,u) G S. We then say that /3 X is translational invariant. 

The proof is again given in the supplemental material. We proof the if part of the theorem 
by induction on |A|, and the only if part by direct manipulation with the expression for the 
energy function. 

To better understand the effect of the theorem we can again consider the 2x2 maximal 
clique case, i.e. C m is given by (J5]) with A 0 = {(0, 0), (0,1), (1, 0), (1,1)}. The transla¬ 
tional invariance means that all first-order interactions {/ft’^f, (/, j) G S'} must be equal 
and in the following we denote their common value by fP, where the idea is that the super¬ 
script represents any node (i,j) G S. Correspondingly we use (P 3 , where the superscript 
represent any two horizontally first-order neighbours, to denote the common value for all 
{/5{( 0 , 0 ) J (0’i)}®(*’"), (t, u) G S}. Continuing in this way we get, in addition to ft fP 3 and 
the constant term /3®, the parameters ft /ft, /ft, /ft, /ft, /ft, /ft and /ft. We collect 
the eleven parameter values necessary to represent U (x) in this stationary MRF case into a 
vector which we denote by ft, i.e. 

fi = (ft fP , ft, ft /ft, /ft, /ft, ft, ft, ft, ft) . (6) 

The next theorem gives a similar result for the <f> x parameters as Theorem [2] did for the 
interaction parameters ft 

Theorem 3 An MRF x defined on a rectangular lattice S = {(i,j)]i = 0,... , n — 1, j = 
0 ,..., m — 1} with torus boundary conditions and C m given in (J5]) is stationary if and only 
if = <f x ®d’ u ) f or all A G C and ( t,u ) G S. We then say that fi x is translational invariant. 

The proof is again given in the supplemental material. Given the result in Theorem [2] it is 
sufficient to show that <f x is translational invariant if and only if /3 X is translational invariant, 
and we show this by induction on |A|. 

ft should be noted that the interpretation of the fi x parameters is very different from 
the interpretation of the fi x parameters. Whereas the fi x parameters relates to cliques A of 
different sizes, all the </> A, s represent the potential of a maximal clique A G £ m , which are all 


of the same size. The effect of the above theorem is that we get groups of configurations in 
maximal cliques that must be assigned the same potential, hereafter referred to as configu¬ 
ration sets. We let C denote the set of these configuration sets. In the 2x2 maximal clique 
case for example, we get 


c = {{[88]}, {[J8], [8i], [?8], [8?]}, {[JJ], [??]}, {[18], [81]}, 

m, {[!?]},{[?!]}, W}} 


(7) 


1 1 


11 11 


11 


We denote these sets of configurations by c°, c 1 , c 11 , c 1 , c ', c 1 , c 1 , c ', c 11 , c 1 * and c 11 
when listed in the same order as in o, where the idea of the notation is that the l’s in 
the superscript can be placed anywhere inside a maximal clique and the remaining nodes 
takes the value of zero. One should note that a similar notation can be used in other sets of 
maximal cliques. In the 3x3 maximal clique case we have for example 


11 

c 1 = 


110 

100 

000 


Oil 

010 

000 


000 

110 

100 


000 

Oil 

010 


Associated to each member c G C we thus have a corresponding parameter value 0(c) which is 
the potential assigned to any maximal clique configuration in the set c. We use corresponding 
superscripts for the 0 parameters as we did for the sets c G C. In the 2x2 maximal clique 
case we thereby get the parameter vector 


n . - M i l l n n l i li 
0 rh 1 rh U M rh 1 M M rh 1 rhll Ml Ml 


where for example 0 1 is the potential for the four maximal clique configurations in c 1 . 

We end this section with a discussion on how the above stationary MRF defined with torus 
boundary condition can be modified in the free boundary case. Using the same template 
maximal clique Ao as before, the set of maximal cliques C m now has to be redefined relative 
to the torus boundary condition case. In the free boundary case we let £, m contain all 
translations of A 0 that are completely inside our n x m lattice, i.e. 


C m = {A 0 + (' t,u)]t = -n,... ,n,u= —m,... ,m, A 0 + (t,u) C S}, 

where A 0 + (£,«) = {(* + £, j + u); ( i,j ) G A 0 }. One should note that for a free boundary MRF 

the translational invariance property of the 0 A parameters identified in Theorem |3] no longer 
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apply, and neither will such a model be stationary. However, the extra free <fi parameters 
that may be introduced in the free boundary case will only model properties sufficiently close 
to a boundary of the lattice. Our strategy in the free boundary case is to keep the same (j) 
parameter vector as in the torus case, to adopt translational invariant potential functions 
Va(1a) — f° r ah maximal cliques A e C m just as in the torus case, but to add non-zero 
potential functions for some (non-maximal) cliques at the boundaries of the lattice. Our 
motivation for this is to reduce the boundary effect and, hopefully, to get a model which is 
less non-stationary. To define our non-zero potential functions at the boundaries, imagine 
that our n x m lattice is included in a much larger lattice and that this extended lattice 
also has maximal cliques that are translations of Ao- We then include a non-zero potential 
function for every maximal clique in the extended lattice which is partly inside and partly 
outside our original n x m lattice. In such a maximal clique in the extended lattice, let A 
denote the set of nodes that are inside our n x m lattice, and let A* denote the set of nodes 
outside. As we have assumed that the maximal clique is partly inside and partly outside our 
original n x m lattice, A and A* are both non-empty and A U A* is clearly a maximal clique 
in the extended lattice. For the (non-maximal) clique A we define the potential function 

V\(x x ) = -^^H AuA *(z AUA *), (8) 

X \* 

where VAua* Ofmja* ) is the same (translational invariant) potential function we are using for 
maximal cliques inside our n x m lattice. One can note that (j8j) corresponds to averaging 
over the values in the nodes outside our lattice, assuming them to be independent, and to 
take the values 0 or 1 with probability a half for each. 


2.2 Example: The Ising model 


The Ising model (Besad 19861) is given by 


p(x) = Z exp -u ^2 1 ( x m ^ X t,u) { , 

C) 


(9) 


where the sum is over all horizontally and vertically adjacent sites, and w is a parameter 

controlling the probability of adjacent sites having the same value. We use the Ising model 
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as an example also later in the paper, and in particular we fit an MRF with 2x2 maximal 
cliques to data simulated from this model. Assuming torus boundary conditions and using 
that for binary variables we have I(xij 7 ^ x t , u ) = %i,j + Xt., u — 2xi,jXt, u , we can rewrite (|9]) as 

p(x) = Z exp < -4u ^ Xij + 2 u ^ Xijx tjU 

[ (*j)GS (■ i,j)~(t,u ) 

Thus, the can be given any value as this will be compensated for by the normalising 

constant, whereas — —4a;, / dh*.i).(*.i)©( 1 .o)} _ 2 W an d ^{(*j),(m')©(o,i)} _ 2 uj, and (3 X = 0 

for all other cliques A. The corresponding <p x parameters can then be found using the recursive 

equation (S2) identified in the proof of Theorem [TJ Using the notation introduced above for 

the 2 x 2 maximal clique case this gives (fr = 0 11 = 77 , </r = 0 11 = (j) 1 = (j) 1 = 0 1 = </> n = 
1 11 

c f> u = —uj + r) and cj) 1 = cj) 1 = — 2 a; + rj, where r/ is an arbitrary value originating from the 
arbitrary value that can be assign to /? . 



3 Prior specification 

In this section we define a generic prior for the parameters of an MRF with maximal cliques 
specified as in ([5]). The first step in the specification is to choose what parametrisation 
of the MRF to consider. In the previous section we discussed two parametrisations for 
the MRF, with parameter vectors f3 and (j), respectively. When choosing between the two 
parametrisations and defining the prior we primarily have the torus version of the MRF in 
mind. However, as the free boundary version of the model is using the same parameter 
vectors, the prior we end up with can also be used in that case. It should be remembered 
that the parametrisations using j3 and (j) are non-identihable, but that it is sufficient to add 
one restriction to make them identifiable. The perhaps easiest way to do this is to restrict 
one of the parameters to equal zero, but other alternatives also exist. We return to this 
issue below. The dimension of the (3 and (j) parameter vectors grows rapidly with the number 
of elements in the set Ao defining the set of maximal cliques. Table |T] gives the number of 

Table [1] approximately here. 

parameters in the identifiable models, which we in the following denote by N\ 01 when A 0 is 
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a k x l block of nodes. We see that the number of parameters grows rapidly with the size 
of A 0 . It is therefore natural to look for prior formulations which include the possibility for 
a reduced number of free parameters. For the [3 parametrisation the perhaps most natural 
strategy to do this is to assign positive prior probability to the event that one or several 
of the interaction parameters are exactly zero. The interpretation of the 0 parameters is 
different from the interpretation of the (3 parameters, and it is not natural to assign positive 
probability for elements of the 0 vector to be exactly zero. A more reasonable scheme here 
is instead to set a positive prior probability for the event that groups of 0 parameters have 
exactly the same value. 


In the Bayesian contingency tab les 


ample 


Dellaportas and Forster (1999), 


iterature the d param etris ation is popular, see for ex- 


Massametal 


( 20091) and Oyerstall and King (2014) 


and references therein, where the second article develops a conjugate prior for this parametri¬ 
sation. However, these results do not directly apply for an MRF where one restricts the 
potential functions to be translational invariant. More importantly, however, the various 
/3 parameters relates to cliques of different sizes and t his make s the interpretation of the 


parameters difficult. In 


Dellaportas and Forster! (119991 1 and in lOverstall and King] (120141 ) 


effort is made in order to create a reasonable multinormal prior for the j3 parameters. In 
contrast, the 0 parameters all represent the potential of a configuration of a maximal clique, 
which is all of the same size. Unless particular prior information is available and suggests 
the opposite, it is therefore natural to assume that all 0 parameters are on the same scale. 
A tempting option is therefore first to assign identical and independent normal distributions 
to these parameters, and obtain identihability by constraining the : sum of the parameters to 
be zero. Thereby the elements of 0 are exchangeable (IDiaconis and Freedma.nl 1980l ). Note 
that the f3 parameters become multinormal also in our case, see for instance (S3) in the 
supplementary materials. In the following we therefore focus on specifying a prior for 0. We 
first introduce notation necessary to define the groups of configuration parameters 0 that 
should have the same value and thereafter discuss possibilities for how to define the prior. 

To define groups of configuration set parameters that should have the same value, let 
Ci ,..., C r be a partition of the configuration sets in C with ^ 0 for i — 1 ,..., r. Thus, 
Ci fl Uj = 0 for i 7 ^ j and C\ U ... U C r — C. For each i = l,...,rwe thereby assume 0(c) to 


12 

























be equal for all c G and we denote this common value by Setting z = {{Ci,pi),i = 
1 ,..., r} we thus can write the resulting potential functions as 

V A (x A \z) = ^ ^ U c ) ■ ( 10 > 

(C,ip)£z \ c£C ) 

We define a prior on the </> parameters by specifying a prior for z. An alternative to this 
construction would be to build up {Ci, ...,C r } in a non-random fashion, constraining the </> 
parameters according to properties like symmetry and rotational invariance. However, our 
goal is that such properties can be inferred from observed data. 

Given all configuration sets, we want to assign positive probability to the event that 
groups of configuration sets have exactly the same parameter value. For instance, the three 
groups in Section 12.21 is an example of such a grouping for a 2 x 2 maximal clique. Since we 
do not allow empty groups C t , the maximum number of groups one can get is N\ 0 + 1. Our 
prior distribution for z is on the form 


p(z) = p({Ci,...., C r })p{{tpx ,..., <p r }\r) 


where p({C \,..., C r }) is a prior for the grouping of the configuration sets, while p({pi, ...,<p r }\r) 
is a prior for the group parameters given the number of groups r. Two possibilities for 
{Ci,..., C r } immediately comes to mind. The first is to assume a uniform distribution on 
the groupings, i.e. 

Pi({Ci ,..., C r }) oc const, 


meaning that each grouping is apriori equally likely. However for p(r), the marginal proba¬ 
bility of the number of groups, this means that most of the probability is put on groupings 
with approximately (W\ 0 + l)/2 groups. In fact the probability p(r) becomes equal to 


p(r) = 


g(N Ao + l,r) 


EiTGA'A. + M)’ 

where g(N Ao + 1, r) is the number of ways N Aq + 1 configuration sets can be organised into r 
unordered groups, remembering that no empty groups are allowed. The function g(iVA 0 + l, r) 
can be written as 

i r /„\ 


9( JV + 1 .’-) = riE( j j(- 1 ) 


i=0 
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and is known as the Stirling number of the second kind ([R onald L. Gr ah am 19881) . For 


the 2 x 2 maximal clique this means for instance that p(r = 1 ) = p(r = 11 ) m 10 -6 while 
p(r = 5) = 0.36. An alternative for p({C\, ..., C r }) is to make the distribution for the number 
of groups uniform. This can be done by defining the probability distribution 


p 2 ({C' 1 ,...,CV}) 


1 

(Aa 0 + l)5'(A r A 0 + 1,7’) 


With this prior a particular grouping with many or few groups will have a larger probability 
than a particular grouping with approximately (AR 0 + l)/2 groups. In the 2x2 case for 
example, the probability of the grouping where all configuration sets are assigned to the 
same group or the grouping with 11 groups is p({Ci}) = p({Ci, ..., Cii}) = 0.09, while 
the probability of a particular grouping with 5 groups is p({Ci,..., C 5 }) ~ 10 '. Observe 
however, that with both priors we have that the groups are uniformly distributed when the 
number of groups is given. As a compromise between the two prior distributions we propose 


p({Ci,C r }) cx p 1 {{C 1 , ..., CV^-XiC'r, Cr}) 


where 0 < 7 < 1 . 

As also discussed above, to get an identifiable model we need to put one additional restric¬ 
tion on the elements of 0 , or alternatively on the ipi parameters. As we want the distribution 
p(ipi,... ,p r \r) to be exchangeable we want the restriction also to be exchangeable in the ipi 
parameters, and set 

(ii) 

(C»ez 

Under this su m -to-zero restriction we assume the apriori to be independent normal with 
zero mean and with a common variance cU. This fully defines the prior for z, except that 
we have not specified values for the two hyper-parameters 7 and cr 2 p . 


4 Posterior sampling 

In this section we first discuss different strategies proposed in the literature for how to handle 

the computationally intractable normalising constant in discrete MRFs, and in particular 

discuss their applicability in our situation. Thereafter we describe the RJMCMC algorithm 

we adopt for simulating from our posterior distribution. 
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4.1 Handling of the normalising constant 


Discrete MRFs contain a computationally intractable normalising constant and this makes 
the fnlly Bayesian approach problematic. Three strategies have been proposed to circum¬ 
vent or solve this problem. The first alternative is to replace the MRF likelihood wi th a 
computationally tractable ap proximation . The early 


the pseudo-likelihood for this, 


Friel et al 


dependency approximation (RDA), and 


(2009) and 


Heikkinen anc 


McGrorv et al 


Austad ( 2011 ) and 


Odgmandeii (1199411 use 


( 2012 ) adopt a reduced 


Tielmeland and Austad (201 


construct a POMM approximation by making use of theory for pseudo-Boolean functions. 
The second strategy, used in Hi gdon et ah ( 119971 ). is to adopt an estimate of the normali¬ 
sation constant obtained by some Markov chain Monte Carlo (MCMC) procedure prior to 
simulating from the posterior, and the third alternative is to include an auxil iary variable 


sampled from the MRF p(x\<p) in the posterior simulation algorithm. Moller et ah 


the first article using the t hird approach, and the e xcha nge algorithm of 


(2006) is 


falls within the same class. 


Cairno and Friel ( 2011 ) and 


Everitt 


Murra y et al. 


( 120061 ) 


( 2012 ) adopt an approximate 


version of this third approach, by replacing perfect sampling from p(x\(f>) with approximate 
sampling via an MCMC algorithm. 

The three approaches all have their advantages and disadvantages. First of all, only the 
third approach is without approximations in the sense that it defines an MCMC algorithm 
with limiting distribution exactly equal to the posterior distribution of interest. However, 
for this approach to be feasible perfect sampling from p{x\(j>) must be possible, and computa¬ 
tionally reasonably efficient, for all values of qb. The strategy used in the second class requires 
in practice that the parameter vector 0 is low dimensional. The approximation strategy does 
not have restrictions on the dimension of 0 and perfect sampling from p(x\<p) is not needed. 
In that sense this approach is more flexible, but of course the approximation quality may 
depend on the the parametric form of the MRF and the value of 0. 

I 11 principle any of the approaches discussed above may be used in our situation, but the 
complexity of the parameter space makes the prior estimation of the normalisation constant 
approach impractical. Moreover, the accuracy of the pseudo-likelihood approximation is 
known to be quite poor, and in simulation exercises we found that perfect sampling from 
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p(x\<p) was in practice infeasible for many of the higher-order interaction models visited by 
our RJMCMC algorithm. The approximate version in Caimo and Fricl (2011) is, however, a 
viable alternative. We are thereby left with the RDA approach, the POMM approximation, 


and the strategy proposed in 


Caimo and Friel (20111). In our simulation examples we adopt 


the second of these, but the other two could equally well h ave been used. In 
our simulation examples we use also the strategy from 


Caimo and Friel (2011) to check the 


act, in one of 


approximation quality obtained when replacing the MRF with the POMM approximation. 


4.2 MCMC algorithm 


Assume that an observed binary n x m image is available. We consider this image as a 
realisation from our MRF with the free boundary conditions defined in Section [2j As a prior 
for the MRF parameters we adopt the prior specified in Section [31 The focus in this section 
is then on how to sample from the resulting posterior distribution. One should note that in 
this section we formulate the algorithm as if one can evaluate the MRF likelihood, including 
the normalising constant. This is of course not feasible in practice, so when running the 
algorithm we replace the MRF likelihood with the corresponding POMM approximation 
discussed above. 

Letting x denote the observed image, the posterior distribution we want to sample from 
is given by 

p(z\x) oc p(x\z)p(z), 


where p(x\z) and p(z) are the MRF defined by (TTOl) and the prior defined in Section [3j 
respectively. To simulate from this poste rior we adopt a reversible jump Markov chain 
Monte Carlo (RJMCMC) algorithm (Green 1995) with three types of updates. The detailed 
proposal mechanisms are specified in the supplementary materials, here we just give a brief 
description of our proposal strategies. 

The first proposal in our algorithm is simply first to propose a change in an existing 
p parameter by a random walk proposal with variance a 2 , and thereafter to subtract the 
same value from all p parameters to commit with the sum-to-zero constraint. In the second 
proposal we draw a pair of groups and propose to move one configuration set from the 
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first group to the second group, ensuring that the two groups are still non-empty. In the 
last proposal type, we propose a new state by either increasing or decreasing the number 
of groups with one. When increasing the number of groups by one we randomly choose a 
configuration set from a randomly chosen group and propose this configuration set to be a 
new group. When proposing to reduce the number of parameters with one, we randomly 
choose a group with only one configuration set and propose to merge this group into another 
group. In the trans-dimensional proposals we ensure that the proposed parameters commit 
with the sum-to-zero constrain by subtracting the same value from all ip parameters. 


5 Simulation examples 


In this section we first present an example based on a simulated data set from the Ising model, 
and thereafter present results for a data set of census counts of red deer in the Grampians 
Region of north-east Scotland. In addition, another example based on simulated data is 
included in the supplementary materials. In all the simulation experiments we use the prior 
distribution as defined in Section [3] In this prior the values of the two hyper-parameters 
and 7 must be specified. We have fixed 07 , = 10 and tried 7 = 0, 0.5 and 1. When discussing 
simulation results we first present results for 7 = 0.5. As the likelihood function we use the 
MRF discussed in Section [2] and we use 2x2 maximal cliques except in the last part of the 
real data example where we also discuss results for 3 x 3 maximal cliques. To cope with 


the computationally intra ctab le normalisin g consta nt of 


approximation strategy of iTi elmeland and Austad 020 


with a partially ordered Markov model (POMM), see 


he MRF likelihoods, we adopt the 


2). TheJVIRF is then appro ximated 


Cre_ssie and Davidson (1998), where 


the conditional distribution of one variable given all previous variables is allowed to depend 
on maximally v previous variables. We have tried different values for v and found that in 
all our examples v = 7 is sufficient to obtain very good approximations, so all the results 
presented here are based on this value of v. To simulate from posterior distributions we use 
the reversible jump MCMC algorithm defined in Section HI In our sampling algorithm we 
have an algorithmic tuning parameter cr 2 as the variance in Gaussian proposals. Based on 
the results of some preliminary runs we set a = 0.3. One iteration of our sampling algorithm 
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is defined to be one proposal of each type. Lastly we note that parallel computing was used 
in order to reduce computational time, and the technique that is used is explained in the 
supplementary materials. 


5.1 The Ising model 

We generated a realisation from the Ising model given in Section 12.21 with uj = 0.4 on 

a 100 x 100 lattice, consider this as our observed data x and simulate by the RJMCMC 

algorithm from the resulting poster ior distribution. The x was obtained using the perfect 

sampler presented in Propp and Wilson! (119961) . From the calculations in Section 12.21 we 

ii i n n i i i i 

ideally want the correct groups, (c , c 11 } {c , c , c 1 ,*: 1 , c 1 ,c 11 ,c 11 }, and {c 1 ,c 1 }, to be 
visited frequently by our sampler. Note that due to our identihability restriction in (fill) 
the configuration set parameters should be close to the values given in Section 12.21 with 
r) = uj. We run our sampler for 20000 iterations and study the simulation results after 
convergence. A small convergence study is included in the supplementary materials for the 
other simulated data set. The acceptance rate for the parameter value proposals is 19%, 
whereas the acceptance rates for the other two types of proposals are both around 1%. The 
estimated distribution for the number of groups is 94%, 5% and 1%, for 3, 4 and 5 groups 
respectively. 

In Figure |T] we have plotted the matrix representing the estimated posterior probability 
of two configuration sets being assigned to the same group. As we can see in this figure, the 

Figure |Tj approximately here. 


configuration sets are separated into 3 groups, and these groups correspond to the correct 
grouping shown in grey. About 94% of the realisations is assigned to this particular grouping, 
and almost all other groupings that are simulated correspond to groupings where the middle 
group is split in various ways, while some very few are splits of the groups {c°,c n } and 
{cfc 1 }. Every one of these alternative groupings have an estimated posterior probability 
of less than 0.5%. 

One informative way to look at the result of the simulation is to estimate the posterior 
distribution for the interaction parameters (3. Histograms and estimated 95% credibility 
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intervals for each of the parameters are given in Figure [2] As we can see, all the true values 

Figure [2] approximately here. 


of the interaction parameters are within the estimated credibility intervals, however the 
modes of the distributions for the pairwise horizontal and vertical second order interactions, 


see Figure 2(b) and 2(c), seem to be somewhat lower than the correct value. 


To study the properties of the MRF p(-\z) when z is a sample from the posterior p(z\x) 
we take 5000 samples from the MCMC run for p{z\x) and generate for each of these a 
corresponding realisation from the MRF p(-\z). To analyse these 5000 images we use six 
statistics describing local properties of the images. The statistics used and resulting density 
estimates (solid) of the distribution of these statistics are shown in Figures [3] (a)-(f). In the 

Figure [3] approximately here. 

same figures we also show density estimates of the same statistics when images are generated 
from the Ising model with the true parameter value (dashed), and when images are generated 
from the Ising model with parameter value to generated from the posterior distribution given 
our observed image x (dotted). In this last case, a zero mean Gaussian prior with standard 
deviation equal to ten is used for uj. In these figures we also see that the data we use for 
posterior sampling (black dots) of z is a realisation from the Ising model with low values for 


the number of equal horizontal and vertical adjacent sites, see Figure 3(b) and 3(c), which 


causes, as already observed above, our simulations of the second order interactions between 
horizontal and vertical adjacent sites to be somewhat lower than the true values. In fact we 
can see that the simulations from the Ising model using posterior samples for the parameter 
value closely follows that of our 2x2 model. This means that the results from our model 
is as accurate as the result one gets when knowing that the true model is the Ising model 
without knowing the model parameter. 

To evaluate the quality of the POMM approximation in this example, we also simulate 


from the posterior distribution with t 
exchange algorithm of Murray eh aL 


re same RJMCMC algorithm using the approximate 


2006), as discussed in Section 14.11 We compare in 


Figures [3] (g) and (h) the results using the POMM approximation (solid) to the results from 

the approximate exchange algorithm (dashed) using two of our six statistics. We observe 
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that the differences are minimal for these two, and indeed we get as accurate results for the 
four other statistics as well. That these two very different approximation strategies produces 
essentially the same results strongly indicate that both procedures are very accurate. 

All the above results are for 7 = 0.5, but as mentioned in the introduction of this section 
we also investigate the results for 7 = 0 and 1. For 7 = 0 the configuration sets are organised 
into 3 ( 66 %), 4 (31%) or 5 (3%) groups, and for 7 = 1 we get 3 (96%) or 4 (4%) groups. 
From these numbers we see the effect of varying 7. In particular when increasing 7 from 0.5 
to 1.0 the tendency to group more configuration sets together becomes stronger for this data 
set. 


5.2 Red deer census count data 


In this section we analyse a data set of census counts of red deer in 
of n orth-east Scotland. A full de scription of the data set is found in 


and 


he Grampians Region 


Augu stin e t ahj ( 1996!) 


Buckland and Elston ( 19931) . The data is obtained by dividing the region of interest 


into n = 1277 grid cells on a lattice and observing the presence or absence of red deer in 
each cell. I 11 our notation this is our observed image x, but in this example we also have the 
four covariates altitude, mires, north coordinate and east coordinate available in each grid 
cell. The binary data x and the two first covariates are shown in Figure |4j We denote the 

Figure [4] approximately here. 


covariate k at a location (i,j) by yij : k, j = 1,2, 3,4, and model them into the likelihood 
function in the following way 

p(x\z,d c ,y) = Zex pf V A (x A \z) + ^ x itj ^9^y idjk j, (12) 

\A e£m (iJ)eS k =l J 

where 6 C = ( 0 f,..., 67 ) are the parameters for the covariates. 

We put independent zero mean Gaussian prior distributions with standard deviation 

equal to 10 on 9^, j — 1,..., 4. In the sampling algorithm these covariates are updated using 

random walk, i.e. we uniformly choose one of the four covariates to update and propose a 

new value using a Gaussian distribution with the old parameter value as the mean and a 

standard deviation of 0 . 1 . 
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We ran our algorithm for 50000 iterations, and the acceptance rates for the parameter 

random walk proposal is 42%, the group changing proposal is 33%, the trans-dimensional 

proposal is 5%, and the covariate proposal is 48%. The posterior most probable grouping 

becomes {c 0 }, {c 1 , c 11 , c 1 , c 1 ,c 1 , c 1 , c *, c 11 , c 11 } and {c 11 } with probability 33.2%. In total 

more than 2500 different groupings are visited, and except for the posterior most probable 

grouping the posterior probabilities of all other groupings are less than 5%. The estimated 

posterior probability distribution for the number of groups becomes 43% for 3 groups, 48% 

for 4 groups, 8% for 5 groups and 1% for 6 groups. In particular, the realisations with four 

11 i n n i l 

or more groups are mostly groupings where the set {c , c , c 1 , c 1 , c 1 , c 1 , c b c 11 , c 11 } is split 
in various ways. This can also be seen in Figure [5j which shows the estimated posterior 
probability of two configuration sets being assigned to the same group. The grey blocks 

Figure |5] approximately here. 

in this figure show the estimated posterior most probable grouping described above. Next 
we estimate the posterior density for the interaction parameters, see Figure [6j As we can 

Figure |6] approximately here. 

see, most of the higher order interaction parameters becomes significantly different from 
zero, suggesting that a 2 x 2 clique system is needed for this data set. Figure [7] shows the 
estimated posterior density for the covariate parameters. As we can see from the credibility 

Figure [7| approximately here. 

intervals, all these parameters are significantly different from zero, which justifies the need 
to include them. Simulations of p(x\z, 6 C , y ) for three randomly chosen posterior samples of 
z and 6 C are shown in Figured] As we can see the spatial dependency in these realisations 

Figure |8] approximately here. 

looks similar to the data which indicates that the features of this data set are captured with 
this model. 

As discussed above, the estimated marginal posterior densities for the interaction pa¬ 
rameters in Figure [6] indicate that higher order interaction parameters are needed for this 
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data set. To investigate this further we also run a corresponding MCMC simulation with a 
prior where the spatial interaction is as in the nearest neighbour autologistic model defined 


m 


Besa g (1 9721 ). whereas the covariates are included as in (TT 2 ]h This pairwise interaction 


prior has three interaction parameters, for first-order interactions and for horisontal and ver¬ 
tical second-order interactions, respectively, and apriori we assume these three parameters 
to be independent and Gaussian distributed with zero-mean and standard deviations equal 
to ten. To simulate these three parameters we randomly choose one and propose a zero 
mean Gaussian change with standard deviation equal to 0.3 to the chosen parameter. For 
the 9j' parameters we adopt the same prior and proposals as before. For the pairwise inter¬ 
action prior and our original prior in (fT 2 ]h Figure [9] shows estimates of the resulting marginal 

Figure [9] approximately here. 


posterior distributions for the same six statistics studied in our Ising simulation example. 
For several of the statistics we see that there is a clear difference between the results for 
the two priors. The differences for the higher-order interaction statistics are perhaps less 
surprising, but one should note that the distribution of the first-order statistic in Figure [9ta) 
also changes quite much when allowing higher order interactions. One should also note that 
our 2 x 2 model fits better to the statistics of the data, shown as black dots in the figures. 

Returning to the 2x2 prior, using 7 = 0 in the prior for this data set gives the estimated 
posterior probability distribution 24%, 63%, 11% and 2% for 3, 4, 5 and 6 groups respectively, 
whereas for 7 = 1 we obtain 60%, 35% and 5% for 3, 4 and 5 groups respectively. Again 
we see that higher values of 7 results in more realisations with fewer number of groups. 
However, for all the three values of 7 the estimated most probable grouping is the same. 

We end our discussion of this data set by mentioning that some results when assuming 
a clique size of 3 x 3 is included in the supplementary material of this paper. These results 
indicate that no more significant structure is introduced in the 3x3 case for this data set. 


6 Closing remarks 


Our main focus in this paper is to design a generic prior distribution for the parameters of an 

MRF. This is done by assuming a set of maximal cliques defined from a template maximal 
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clique A 0 , but as the number of free parameters grows quickly as a function of the number 
of elements in A 0 we construct our prior distribution such that it gives a positive probability 
for groups of parameters to have exactly the same value. In that way we reduce the effective 
number of parameters, still keeping the flexibility large cliques provides. Proposal distribu¬ 
tions that enable us to simulate from the resulting posterior distributed is also presented. 
Ho weve r , to e valuate the likelihood we use a previously defined approximation to MRFs 


Austad 


201 ll) . and the trade off between accuracy and computational complexity limits in 


practice the size of the cliques that can be assumed. An alternative to approximations is 
perfect sampling flPropp and Wilsonl 19961 ). but this was in all our examples too computa¬ 


tionally intensive. A third alternative would 
a perfect sample, as described in for instance 


je to u se an MCMC sample of x instead of 


Everitt 


(2012). An issue with this approach 


is the need to set a burn in period for the sampler of x, where a too long burn in period 
would make the parameter sampler too intensive. Lastly, we illustrate the effect of our prior 
distribution and sampling algorithm on two examples. 

Our focus in this paper is on binary MRFs. It is however possible to generalise our 
framework to discrete MRFs, i.e. where x* 6 {0,for K > 2. An identifiable 
parametrisation of a discrete MRF using clique potentials can with a small effort be defined in 
a similar way to what is done in the binary case, and ones this parametrisation is established, 
the prior distribution presented in this paper can be used unchanged. The same apply to 
our sampling strategy. 

With our prior distribution the size of the maximal cliques, and thereby the number of 
configuration sets, act as a hyper parameter and must be set prior to any sampling algorithm. 
One could imagine putting a prior also on A 0 , introducing the need to construct algorithms for 
trans-dimensional sampling also for Ao- Another way to avoid the need to set the number of 
configuration sets would be to construct a prior distribution for the f3 parameters. A natural 
choice would be to construct a positive prior probability for these parameters to be exactly 
zero, and in this way the significant interactions of an MRF can be inferred from data. 
However, as discussed above, it is not clear to us how to design generic prior distributions 
for the values of these interaction parameters, as higher order interactions intuitively would 
be different from lower order interaction. Also, grouping ft parameters together in order to 
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reduce the number of parameters would, for the same reason as above, make little sense. An 
ideal solution would be somehow to draw strength from both of the two parametrisations 
in order to assign a prior distribution to both the appearance of different cliques and the 
number of free parameters. This idea is currently work in progress. 


Supporting Information 

Additional Supporting Information may be found in the online version of this article: 


Section S.l: 
Section S.2: 
Section S.3: 
Section S.4: 
Section S.5: 
Section S.6: 
Section S.7: 


Proof of one-to-one relation between <f> and (5. 

Proof of translational invariance for j3. 

Proof of translational invariance for qb. 

Details for the MCMC sampling algorithm. 

The independence model with check of convergence. 
Reed deer data with 3x3 maximal cliques. 
Parallelisation of the sampling algorithm. 
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Table 1: 


k x l 2 kl N a , 

"T5T2 4 

2x2 16 10 

2 x 3 64 44 

3 x 3 512 400 

3 x 4 4096 3392 

4 x 4 65536 57856 

The number of configurations and the corresponding number of free parameters 


N Aq when A 0 is a k x l block of nodes. 
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Figure 1: Ising model example: Estimated posterior probabilities for two configuration sets 
to be grouped together. The true grouping is shown in grey, and only probabilities larger 
than 5% are given. Note the permutation done to the ordering of the configuration sets c t . 
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(a) /3 d (-1.62, -1.38) (b) 0^ (0.69,0.81) 
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(c) /£ I (0.69,0.81) (d) /ft (-0.09,0.09) 



- 0.2 - 0.1 0.0 0.1 0.2 - 0.2 - 0.1 0.0 0.1 0.2 


(e) /3nP (-0.09,0.09) (f) /fP (-0.10,0.08) 
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(i) flB (-0.09,0.09) (j) /ffl (-0.17,0.19) 

Figure 2: Ising model example: Estimated marginal posterior distribution for the inter¬ 
action parameters. True values are shown with a black dot and estimated 95% credibility 
intervals are given for each parameter. 
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(a) g(x) = £i Xi (b) g(x) = Ei,j: vertical adjacent sites 1 ( X i = X 3 ) 
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(c) g(x) = Ei, ^horizontal adjacent sites = X i) ( d ) S(*) = EA G £ m 1 (*A = [88] ) 
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( e ) 9 (x) = Ea ec m 1 (*a = [So] ) ( f ) 9{x) = Ea e£ m 7 (*A = [ 10 ]) 



4700 4800 4900 5000 5100 5200 5300 450 500 550 600 650 


(g) g(x) = E; Xi (b) g(x) = E A e£m 1 ( x a = [ 10 ]) 

Figure 3: Ising model example: (a)-(f) Distribution of six statistics of realisations from 

our 2x2 model with posterior samples of z (solid), the Ising model with correct parameter 

value (dashed), and the Ising model with posterior samples of the parameter value (dotted). 

In (g) and (h) we compare two of these statistics with results obtained using the exchange 

algorithm with MCMC samples as auxiliary variables (dashed) instead of the approximation. 

The data evaluated with each statistic is shown with a black dot. 
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Figure 4: Red deer example: The presence/absence of red deer (left), altitude (middle), 
and mires (right) in the Grampians Region of north-east Scotland. 
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Figure 5: Red deer example: Estimated posterior probabilities for two configuration sets 
to be grouped together. The estimated most probable grouping is shown in grey, and only 
probabilities larger than 5 % are given. 
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(a) /3 d (-3.52,-2.85) 


(b) /3™ (0.92,1.77) 



(c) fB (0.91,1.77) 


(d) /fb (0.62,1.54). 



(e) /3nP (0.14,0.88) 


(f) /£P (-1.08,-0.09) 
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(g) /ZB (-1.54, -0.25) (h) /fb (-1.55, -0.28) 



- 2-1 0 1 2 - 6 - 4-2 0 2 


(i) /3C B (-0.88,0.47) (j) /£B (-3.70,0.29) 

Figure 6: Red deer example: Estimated marginal posterior distribution for the interaction 

parameters. Estimated 95% credibility interval is given for each parameter. 
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- 0.8 - 0.6 - 0.4 - 0.2 0.0 - 0.4 - 0.3 - 0.2 - 0.1 0.0 0.1 

(c) eg (-0.55,-0.27) (d) eg (-0.26,-0.05) 

Figure 7: Red deer example: Estimated marginal posterior distributions for the parameters 

of the covariates. Estimated 95 % credibility interval is given for each parameter. 



Figure 8: Red deer example: Three realisations from the likelihood for three random 
samples of 2 from the posterior distribution. 
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(a) g (x) = E* Xi (b) g(x) = Ei,*™ rtical adjacent sites T ( X i = X j) 



40 60 80 100 0 5 10 15 20 25 30 


( e ) 9(x) = Ea e£m 1 (xa = [ io ]) ( f ) g(x) = E A e£ m 1 ( x a = [ 10 ]) 

Figure 9: Red deer example: Distribution of six statistics of realisations from our 2x2 

model with posterior samples of 2 (solid), and the nearest neighbour pairwise interaction 

model (dashed). The data evaluated with each statistic is shown with a black dot. 
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S.l Proof of one-to-one relation between 0 and [3 

Theorem 1 Consider an MRF and constrain the 0 parametrisation of the potential func¬ 
tions as described in Definition 1. Then there is a one-to-one relation between {/3 A ; A £ £} 
and {0 A ; A G £}. 

Proof of Theorem 1 We prove the theorem by establishing recursive equations showing 
how to compute the /3 A ’s from the 0 A ’s and vice versa. 

Setting x = 1 A for some A £ £ into the two representations ofU(x) in (2) and (4), we 
get 

E WiA) = E e* II W 

A e£m A *e£ (i,j)€ A* 

Using (3) and Definition 1 this gives 

E ^ JnA = E ?’■ < S1 ) 

A eCm A*CA 

Splitting the sum on the left hand side into one sum over A £ £^ and one sum over A £ 
C m \ C^, and using that A D A = A for A £ C^ we get 

ia a „I 0 a + E ^ nA = E/ jA '- 

A eCm.\c^ a*ca 


1 


Solving for f> x gives 


1^ 


Y,0 y - E ^ nA - < S2 > 

_a*ca a aCm \ c ^ 

Clearly |AflA| < |A| when A G C m \C ) f l , so (IS2|) implies that we can compute all {</> A ; A G C} 
recursively from {/5 A ; A G £}. First we can compute /3 X for |A| = 0, i.e. (j) x — (jfl — fd®/\C m \, 
then all fd x for which |A| = 1, thereafter all fd x for which |A| = 2 and so on until fd x has been 
computed for all A G C. Thus, {0 A ; A G £} is uniquely specified by { f3 x ; A G £}. 

Solving (ISll) with respect to /3 X we get 


,s j =E> AnA -X> A '> ( S3 > 

A &Cm A*cA 

and noting that clearly |A*| < |A| when A* C A we correspondingly get that {/3 A ; A G £} can be 
recursively computed from {0 A ; A G £}. One must first compute /3 X for |A| = 0, i.e. [d®, then 
all (3 X for which |A| = 1, thereafter all fd x for which |A| = 2 and so on. Thereby {/3 A ; A G £} 
is also uniquely specified by {</> A ; A G £} and the proof is complete. 


S.2 Proof of translational invariance for [3 

Theorem 2 An MRF x defined on a rectangular lattice S = = 0, ...,n — l,j = 

0,... ,m — 1} with torus boundary conditions and C m given in (5) is stationary if and only 
if /3 X = fd x ®d’ u ) for a n \ g £, (t,u) G S. We then say that /3 X is translational invariant. 

Proof of Theorem 2 We first prove the only if part of the theorem by induction on |A|. 
Since 0 © (t, u) = 0 we clearly have (3® = jd®®^ u ) and thereby fd x = (d x ®d,u) fo r |^| — o. 
Now assume all interaction parameters up to order o to be translational invariant, i.e. /3 X = 
p x <s(t,u) j Qr |_^| < Q Now focusing on any A* G C with |A*| = o+l, the assumed stationarity 
in particular gives that we must have p(l A *) = Using (2) and (4) it follows that 

p x * + J2 p x = P x *® it,u) + J2 p x - 

AcA* AcA*e(t,u) 

Rewriting the sum on the right hand side we get 

p x * + = P x *® [t,u) + P mt,u \ 

AcA* AcA* 


2 





where the induction assumption gives that the two sums must be equal, and thereby j3 x * = 
completes the only if part of the proof. 

To prove the if part of the theorem we need to show that if the interaction parameters are 
translational invariant then U( 1 A ) = for any ACS and ( t,u ) G S. For U(1 A ) 

we have 


u( I- 4 ) = e ja n ^ 

Ag£ (ij')G A 

= n Si 

A eC 

_ ^Affi(i,u) 

Ag£ (i,j) GA 

= XX II li? (t ’ U) = C/(l A ® {t ’ u) ), 

Ag£ (ij) GA 


where the first equality follows from (4), the second equality is true because {A © (t,u)\ A G 
£} = £ for any (t,u) G S', i/ie third equality follows from the identity 1^-w^ = 
and the fourth equality is using the assumed translational invariance of the interaction pa¬ 
rameters. Thereby the proof is complete. 


S.3 Proof of translational invariance for <j) 

Theorem 3 An MRF x defined on a rectangular lattice S = {(i,j)]i = 0,. . .,n — l,j = 
0,... ,m — 1} with torus boundary conditions and £ m given in (5) is stationary if and only 
if <f x = for all A G £ and ( t,u ) G S. We then say that <f x is translational invariant. 

Proof of Theorem 3 Given the result in Theorem 2 it is sufficient to show that 0 A is 
translational invariant if and only if (3 X is translational invariant. We first assume (3 X to be 
translational invariant for all A G £ and need to show that then also </> A must be translational 
invariant for all A G £. Starting with (jS2l) . repeatedly using the specific form we are using 


3 


for C m and the assumed translational invariance for f3 x , we get for any A G £, (t,u) G S 


,u) _ 


r* A©(t,lt) 

L-'m 


_ yy 0(Ae(i,u))nA + yy ^(Ae+i+nA 
A*CA©(t,u) Ae£ m 


A®(t,u) 


1^ 

1 

W 

1 

I 


yy p\*®(t,u) _ y+ 


A*CA 


Ae£„ 


Aec 

£u))n(A©(t,u)) _|_ y ^ 

ag£A 


t,M))n(A©(t,u)) 


E + -E 0(AnA)®(t,u) + y^ 0(AnA)®(t, 


«0 


A*CA 


E' 3 "- E 0 (AnA)e(t,n) 

A €.£rn\£'m 


A*CA 


From this we can use induction on |A| to show that <p x is translational invariant for all A G £. 
Setting A = 0 we get 0 A ®(*> U ) = (1/1|)/3® which is clearly not a function of ( t,u ). T/ien 
assuming 0 A ®(*> U ) = (f x for all A G £ with |A| < o, considering the above relation for a A with 
|A| =o+l, and observing that |A D A| < o w/ien |A| = o + 1 and A G £ m \ £^ ; we get that 
also (f) x ®d’ u ) — (j) x f or | A | = o + 1, and the induction proof is complete. 

Next we assume </> A to be translational invariant for all A G £ and need to show that 
then also /3 X is translational invariant for all A G £. Starting with (I53j) . using the assumed 
translational invariance of (3 X , and again repeatedly using the specific form of C m , we get 


®(t,u) _ y ^ 


t , u))nA _ - E /3 a* 

A*cAffi(t,u) 

yy ^(A®(t,u))n(Aw(t,u)) _ yy ^x *®(t,u) 


Ae£ 


Ae£ n 


yy 0(AnA)®(t,u)) _ yy p. 


A*cA 

A*ffi(t,u) 


Ae£ n 


A*cA 


Using this we can easily use induction on |A| to s/iow that we must have fj x ®( l ’ u ) = /3 X , The 
proof is thereby complete. 
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S.4 Details for the MCMC sampling algorithm 


In this section we provide details of the proposal distributions that we use when sampling 
from the posterior distribution 

p(z\x) oc p(x\z)p(z), 

where p{x\z) and p(z) are the MRF and the prior given in the paper, respectively. 

To simulate from this posterior distribution we adopt a reversible jump Markov chain 
Monte Carlo (RJMCMC) algorithm with three types of updates. The first update type 
uses a random walk proposal for one of the tp parameters, the second proposes to move one 
configuration set to a new group, and the third proposes to change the number of groups, r, in 
the partition of the configuration sets. In the following we describe the proposal mechanisms 
for each of the three update types. The corresponding acceptance probabilities are given by 
standard formulas. It should be noted that only the last type of proposal produces a change 
in the dimension of the parameter space. 

5.4.1 Random walk proposal for parameter values 

The first proposal in our algorithm is simply to propose a new value for an already existing 
parameter using a random walk proposal, but correcting for the fact that the parameters 
should sum to zero. More precisely, we first draw a change £ ~ N(0,<r 2 ), where a 2 is an 
algorithmic tuning parameter. Second, we uniformly draw one element from the current 
state z = {( Ci , (pi),i — 1,..., r}, (Cj, ipf say, and define the potential new state as 

= = - M + !»■ ■ > r } u { (Ci^Pi + e- ^ |. 

5.4.2 Proposing to change the group for one configuration set 

Letting the current state be z = {(Cj,</?j),i = 1,..., r}, we start this proposal by drawing 
a pair of groups, C* and Cj say, where the first set Ci is restricted to include at least two 
configuration sets. We draw Ci and Cj so that the difference between the corresponding 
parameter values, pi — (pj, tend to be small. More precisely, we draw (i,j) from the joint 


5 



distribution 


{ exp (—(ifi — ipj ) 2 ) if i 7 ^ j and group C % contains at least two configuration sets, 
0 otherwise. 

Thereafter we draw uniformly at random one of the configuration sets in C t , c say. Our 
potential new state is then obtained by moving c from C t to C r Thus, our potential new 
state becomes 


2* = (z \ {( Q , (fit), ( Cj , ipj)}) U {{Ci \ c, (Cj U c, (pj)} . 

S.4.3 Trans-dimensional proposals 

Let again the current state be z = {(Cj, </?*), i — 1,..., r}. In the following we describe how 
we propose a new state by either increasing or reducing the number of groups, r, with one. 
There will be a one-to-one transition in the proposal, meaning that the opposite proposal, 
going from the new state to the old state has a non-zero probability. We make no attempt 
to jump between states where the difference between the dimensions are larger than one. 

First we draw whether to increase or to decrease the number of groups. If the number of 
groups are equal to the number of configurations sets, no proposal to increase the number of 
groups can be made due to the fact that empty groups have zero prior probability. In that 
case we propose to decrease the number of dimensions with probability 1. In our proposals 
we also make the restriction that only groupings containing at least one group with only one 
configuration set can be subject to a dimension reducing proposal. In a case where no such 
group exists, a proposal of increasing the number of dimensions are made with probability 1. 
In a case where both proposals are allowed we draw at random which to do with probability 
1/2 for each. Note that at least one of the two proposals is always valid. 

We now explain how to propose to increase the number of groups by one. We start by 
drawing uniformly at random one of the groups with more than one configuration set, C* 
say, which we want to split into two new groups. Thereafter we draw uniformly at random 
one of the configuration sets in C t , c say, and form a new partition of the configuration sets 
by extracting c from Ci and adding a new group containing only c. Next we need to draw a 
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parameter value for the new group {c}, and the parameter values for the other groups also 
need to be modified for the proposal to conform with the requirement that the sum of the 
(proposed) parameters should equal zero. We do this by first drawing a change e ~ N(0,cr 2 ) 
in the parameter value for c, where a 2 is the same tuning parameter as in the random walk 
proposal. We then define the potential new state as 

- yy-j-(<ft + e)^ , j = 1,. • -,i ~ M + 1,. • .,r-1 U 
(^Ci \ c, ip t - ^-j-y (<^ + <Pi + £- y^yy {ipi + e) 

Next we explain the proposal we make when the dimension is to be decreased by one. 
Since we need a one-to-one transition in our proposals, we get certain restrictions for these 
proposals. In particular, the fact that only groupings containing at least one group with only 
one configuration set are possible outcomes from a dimension increasing proposal dictates 
that dimension decreasing proposals only can be made from such groupings. Assume again 
our current model to be z = {(C*, <Pi),i = 1,..., r}, where at least one group contains only 
one configuration set. The strategy is to propose to merge one group consisting of only one 
configuration set into another group. As in Section IS. 4. 21 we draw the two configuration sets 
to be merged so that the difference between the corresponding parameter values tend to be 
small. More precisely, we let the two groups be Ci and C 3 where (i , j ) is sampled according 
to the joint distribution 

{ exp {—{'•Pi — <Pj ) 2 ) if i ^ j and Ci consists of only one configuration set, 

0 otherwise. 

Next we need to specify potential new parameter values. As these must conform with how 
we generated potential new values in the split proposal, we have no freedom left in how to 
do this. The potential new state must be 

2* = j (c k ,p k + yyjy, k G {1,..., r} \ {i, j}| U j [cj U C h ipj + ^-y <p^j j . 

The split and merge steps produce a change in the dimension of the parameter space, so 

to calculate the acceptance probabilities for such proposals we need corresponding Jacobi 

determinants. It is straightforward to show that the Jacobi determinants for the merge and 

split proposals become and respectively. 
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S.5 The independence model with check of convergence 


Consider a model where the variables are all independent of each other and p(xij) = p Xi ’ j (l - 
pY ~ Xi ,3 for each (i,j) G S and where p is the probability of x tJ being equal to 1. We get 


p(x) = p Xi>j ( 1 — p ) 1 Xi ’ j oc exp J a ^ 


x. 


M i 


(S4) 


(O')es 


h,i)G5 


where 


a = In 


P 


1 — p 


In this section we use the independence model as an example, and in particular we fit an MRF 
with 2x2 maximal cliques to data simulated from this model. Therefore it is helpful to know 
how one can represent the independence model using 2x2 maximal cliques, and this can be 
done using the same strategy that was used for the Ising model in Section 2.2 in the paper. 
We get 0° = r), 0 1 = f + p, 0 11 = 0 1 = 0 1 = 0 1 = | + 77 , 0 1 = 0 1 = 0 11 = 0 11 = ^ + 77 
and 0 11 = a + p, where 77 is an arbitrary value coming from the arbitrary value for j3 v . If 
p = 0.5 we see that a = 0 and all the configuration set parameters are equal. 

We generate a realisation from the independence model with p = 0.3 on a 100 x 100 
lattice, consider this as our observed data x, and simulate by the MCMC algorithm defined 
in Section 4 from the resulting posterior distribution. Using the notation for the configuration 
sets in a 2 x 2 maximal clique and the results above, we ideally want our algorithm to produce 
realisations with the groups {c 0 }, {c 1 }, {c 11 ,c 1 ,c 1 ,c 1 }, {c 1 , c x , c 11 , c 11 } and {c 11 }. Note 
that due to our identihability restriction in ( 11 ) the configuration set parameters should be 
close to the solution above with p = —a/2. To check convergence we investigated trace plots 
of various statistics, see Figure IS. 5. 11 and the conclusion was that the algorithm converges 
very quickly. The acceptance rate for the parameter value proposals is 24%, whereas the 

Figure IS. 5. II approximately here. 


acceptance rates for the other two types of proposals are both around 2%. We run our 
sampling algorithm for 20000 iterations, and estimate the posterior probability of the number 
of groups. The configuration sets are organised into 4 (77%), 5 (21%) or 6 (2%) groups, so for 
these data the grouping tends to be a little bit too strong compared to the correct number of 







groups. This can also be seen from the estimated posterior probability of two configuration 
sets being assigned to the same group, shown in Figure IS. 5. 21 This figure suggests the 

Figure IS. 5. 21 approximately here. 


ii in n l i n 

four groups {c u }, {c 1 }, {c , c 1 , c 1 ,c 1 , c 1 }, {c 1 , c 11 , c 11 , c 11 } which is also calculated to be 

the most probable grouping estimated by counting the number of occurrences. In fact the 

posterior probability for this grouping is as high as 55%. In Figure IS. 5. 21 we also see how the 

most probable grouping differ from the correct model grouping, shown in grey. The group 
n n i l 

{c 1 , c 1 , c 11 , c 11 } in the correct model is split in two in the most probable grouping, and the 

n n 1 1 1 i l 

subsets {c and {c 1 ,c n ,c n } are inserted into the correct model groups {c , c4,c l ,c l } 

n 

and {c 11 }, respectively. 

As in the Ising model example in Section 5.1 we estimate the posterior distribution for 
the interaction parameters, see Figure IS. 5. 31 As we can see, the true value of the interaction 

Figure IS. 5. 31 approximately here. 


parameters are mostly within the credibility intervals, but the tendency to group the config¬ 
urations too much is in this case forcing some of the true values into a tail of the marginal 
posterior distributions. 

As in the Ising example we compare the distribution of the same six statistics from 
simulations from our 2x2 model with posterior samples of z, the independence model with 
correct parameter value, and the independence model with parameter value obtained by 
posterior sampling, see Figure IS - . 5.41 As we can see, our model captures approximately the 

Figure 15.5.41 approximately here. 


correct distribution of the chosen statistics. It is interesting to note that for some statistics 
the realisations from the independence model with simulated a values follows our model 
tightly whereas for the other statistics it is close to the correct model. 

Also for this data set we investigated the case where 7 = 0 and 7 = 1 . For 7 = 0 the 
configuration sets are organised into 4 (75%), 5 (23%) or 6 (2%) groups, and for 7 = 1 we 
get 4 (93%), 5 (6%), 6 (1%) groups. As expected we again see the tendency towards stronger 
grouping when 7 is increased. 
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We also did experiments were the value of p was changed. If the value of p is close to 0.5 
the tendency to group the configurations too much becomes stronger. This makes perfectly 
sense, since the correct grouping for p = 0.5 is to put all configuration sets into only one 
group. In the other end, choosing p closer to 0 or 1 gives a stronger tendency to group the 
configurations according to the correct solution. This illustrates the fact that the algorithm 
tries to find a good model for the data using as few groups as possible, but as the difference 
between the true parameter values of the groups becomes larger the price to pay for choosing 
a model with fewer parameters increases. 

S.6 Red deer data with 3x3 maximal cliques 

In this section we present some results when assuming maximal clique size of 3 x 3 for the red 
deer data set presented in Section 5.2 in the paper. The main drawback with our approach 
is computational time, which is very dependent on the approximation parameter v. One also 
needs to keep in mind that even data from simple models will need many groups in the 3x3 
case to be modelled correctly. For instance, for the independence model the 401 configuration 
sets would need to be separated into 10 groups, while for the Ising model one would need 11 
groups to get the correct model grouping. Similarly, the posterior most probable grouping 
found for the 2x2 case for the reed deer example would need 38 groups to be modelled 
in the 3x3 case. Thus it is important not to assume larger maximal cliques than needed. 
However for this data set it is possible to run the sampling algorithm with 3x3 clique, even 
though this is computationally expensive. 

To get convergence we need a small generalisation to the proposal distribution for the 
trans-dimensional sampling step presented in Section IS. 4.31 In particular we allow for several 
configuration sets to be split, out into a new group at a single proposal, and correspondingly 
allow for the possibility of several configuration sets to be merge into another group in one 
single proposal. The estimated marginal distribution of the number of groups is 1%, 65%, 
33%, and 1% for 29, 30, 31 and 32 groups respectively. Three realisations from the likelihood 
for three randomly chosen realisations of z are shown in Figure 15.6.1( a). and comparing with 
the realisations for the 2x2 case, see Figure 8 in the paper, it is hard to see any differences 
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in the spatial structure of the realisations. 


Figure IS. 6. II approximately here. 

We also investigated the distribution of three statistics for 5000 realisation from the 
likelihood of each of the two clique sizes, see Figure IS.G.lf b). and it appears to be little 
difference also here. These results indicate that 2x2 maximal cliques might have sufficient 
complexity to explain this data set. 

S.7 Parallelisation of the sampling algorithm 

Most of the computing time for running our sampling algorithm is used to evaluate the 
likelihood in (2). In order to reduce the running time we adopt a scheme that do multiple 
updates of the Markov chain by evaluating likelihoods in parallel. 

Assume we are in a state z and propose to split/merge into a new state Z\. Now there 
are two possible outcomes for this proposal. Either we reject the proposal, which results in 
state z, or we accept the proposal, which results in state Z\. Either way we always propose 
a parameter update in the next step, and proposing this step from both the two states z 
and Z\ before evaluating the acceptance probability for the split/merge step is possible. The 
possible outcomes for these three proposals are z, Z \, z 2 and z 12 , where z is the outcome 
where neither the split/merge proposal nor the following parameter proposal is accepted, Z\ 
is the outcome where the split/merge proposal is accepted but not the following parameter 
proposal, z 2 is the outcome where the split/merge proposal is not accepted but the parameter 
proposal is, and Z \ 2 is the outcome where both the split/merge proposal and the following 
parameter proposal are accepted. If we continue the argument we can do the same to propose 
updates where configurations are moved from one group to another group, and in the red 
deer example we even include a level where updates of covariates are proposed. After making 
all proposals we evaluate the likelihoods for each possible state in parallel. The result is that 
we do need to evaluate too many likelihoods, but if the number of CPUs that are available 
is larger than or equal to the number of likelihoods we need to evaluate, a computational 
gain close to the number of levels is obtained. The updating scheme is illustrated in Figure 

EEE 
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Figure IS. 7. II approximately here. 




Figure S.5.1: Independence model example: Trace plots for the first quarter of the posterior 
simulation run. Solid curves are the result from a simulation where the initial number of 
groups is 1, and dashed curves are from a run with an initial value of 11 (maximal) number 
of groups. 
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Figure S.5.2: Independence model example: Estimated posterior probabilities for two 
configuration sets to be grouped together. The correct grouping is shown in grey, and only 
probabilities larger than 5% are given. 
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(a) /3 d (-1.08,-0.80). (b) /3™ (-0.05,0.22). 
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(c) /B (-0.05,0.23). (d) /ft (-0.11,0.12). 
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Figure S.5.3: Independence model example: Estimated marginal posterior distribution of 

the interaction parameters. True values are shown with a black dot and estimated 95% 

credibility interval is given for each parameter. 
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(c) g{x) = Ei, ^horizontal adjacent sites J (^ = X o) ( d ) fl(*) = E A G £ m 1 (*A = [88] ) 
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(e) s(*) = EAe£ m 1 ( x a = [10]) (f) g{x) = T,Aec m 1 ( x a = [10]) 

Figure S.5.4: Independence model example: Distribution of six statistics of realisations 

from our 2x2 model with posterior samples of z (solid), the independence model with 

correct parameter value (dashed), and the independence model with posterior samples of the 

parameter value (dotted). The data evaluated with each statistic is shown with a black dot. 
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(a) Three realisations from the likelihood for three random samples of z from the posterior distri¬ 
bution. 
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(b) Distribution of three functions of realisations from the likelihood with 3x3 cliques (solid) 
and 2x2 cliques (dashed). The three functions are g(x) = ^ = jjjjjj ) (left), g(x) = 

E Ae ,C m 1 (*A = H) (middle), and g(x) = E Ae£ m 1 (*A = [}jg]) (right). 

Figure S.6.1: Red deer 3x3 example: Posterior results with 7 = 0.5. 



Figure S.7.1: Proposal scheme for parallel likelihood evaluations. Starting in model z, 
proposals are made down the graph. Arrows pointing straight down represents rejection of 
proposal while arrow pointing down and left represent acceptance. Double squares are used 
to represents states where a new likelihood evaluation is needed. 
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